Welcome back to deep learning and today I want to tell you a bit about attention mechanisms
and how they can be employed in order to build better deep neural networks.
Okay so let's focus your attention on attention and attention mechanisms.
So what is attention?
So you see that we humans process data actively by shifting our focus.
So we can focus on specific parts of images because they carry different information and
of course for some words for example we can derive the correct meaning only by means of
context and therefore you want to shift your attention and the way how you shift your attention
may also change the interpretation.
So you want to remember specific related events in the past in order to influence a certain
decision.
And this then allows to follow one thought at a time while suppressing information irrelevant
for the task.
So this is the idea you only want to focus on the relevant information.
One example that you could think of is the cocktail party problem where you have many
different people talking and you just focus on one person and by specific means like by
looking at the lips of that person you can focus your attention on the lips and then
you can also kind of use your stereo hearing as a kind of beam former to only listen into
that particular direction and then you are able to concentrate only on a single person,
the person that you are talking to using this kind of attention mechanism and we do that
quite successfully because otherwise we would be completely incapable of communication during
a cocktail party.
So what is the idea?
Well, you already seen those saliency maps and you could argue that we only want to look
at the pixel that are relevant for our decision in order to make the decision.
In fact we will start not with images but we want to talk first about sequence to sequence
models and here you can see a visualization of gradients from a CNN type of model that
is used to translate from English to German and if you now start plotting those gradients,
so this is essentially the visualization technique that we already looked at in image processing,
you can now see the gradient with respect to that particular input of the respective
output.
And if you do so you can notice that in most cases you see this essentially linear relationship
that English and German have of course very similar sequences in terms of words but then
you see that the actually beginning of the sequence which starts with to reach the official
residency of Prime Minister Nawaz Sharif is then translated to the offizielle residence
des Premier Ministers Nawaz Sharif zu erreichen.
So zu erreichen is to reach but to reach is first in English and it goes last in German.
So there is a long temporal context between the two but these two words essentially translate
to those two.
So we can use this kind of information that we can generate with these gradient back propagation
in order to figure out which parts of the input sequence are related to which parts
of the output sequence.
Now the question is how can we model this in order to boost the performance and let's
look at a typical translation model.
So what is typically done in sequence to sequence model if you work with recurrent neural networks
what you do is you have a forward pass in an encoder network and this receives some
input sequence and from the input sequence as we discussed earlier then we can compute
the hidden states h1 to ht.
So we have to process the entire sequence in order to compute ht and then ht is used
as a context vector for the decoder network so ht is essentially the representation of
Presenters
Zugänglich über
Offener Zugang
Dauer
00:23:26 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 20:26:19
Sprache
en-US
Deep Learning - Visualization & Attention Part 5
This video explains the concepts of attention in deep learning.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning
Links
Yosinski et al.: Deep Visualization Toolbox
Olah et al.: Feature Visualization
Referemces
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly Learning to Align and Translate”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, 2015.
[2] T. B. Brown, D. Mané, A. Roy, et al. “Adversarial Patch”. In: ArXiv e-prints (Dec. 2017). arXiv: 1712.09665 [cs.CV].
[3] Jianpeng Cheng, Li Dong, and Mirella Lapata. “Long Short-Term Memory-Networks for Machine Reading”. In: CoRR abs/1601.06733 (2016). arXiv: 1601.06733.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805.
[5] Neil Frazer. Neural Network Follies. 1998. URL: https://neil.fraser.name/writing/tank/ (visited on 01/07/2018).
[6] Ross B. Girshick, Jeff Donahue, Trevor Darrell, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: CoRR abs/1311.2524 (2013). arXiv: 1311.2524.
[7] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines”. In: CoRR abs/1410.5401 (2014). arXiv: 1410.5401.
[8] Karol Gregor, Ivo Danihelka, Alex Graves, et al. “DRAW: A Recurrent Neural Network For Image Generation”. In: Proceedings of the 32nd International Conference on Machine Learning. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, July 2015, pp. 1462–1471.
[9] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, et al. “Neural Machine Translation in Linear Time”. In: CoRR abs/1610.10099 (2016). arXiv: 1610.10099.
[10] L. N. Kanal and N. C. Randall. “Recognition System Design by Statistical Analysis”. In: Proceedings of the 1964 19th ACM National Conference. ACM ’64. New York, NY, USA: ACM, 1964, pp. 42.501–42.5020.
[11] Andrej Karpathy. t-SNE visualization of CNN codes. URL: http://cs.stanford.edu/people/karpathy/cnnembed/ (visited on 01/07/2018).
[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances In Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 1097–1105. arXiv: 1102.0183.
[13] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches to Attention-based Neural Machine Translation”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Lisbon, Portugal: Association for Computational Linguistics, Sept. 2015, pp. 1412–1421.
[14] A. Mahendran and A. Vedaldi. “Understanding deep image representations by inverting them”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2015, pp. 5188–5196.
[15] Andreas Maier, Stefan Wenhardt, Tino Haderlein, et al. “A Microphone-independent Visualization Technique for Speech Disorders”. In: Proceedings of the 10th Annual Conference of the International Speech Communication Brighton, England, 2009, pp. 951–954.
[16] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent Models of Visual Attention”. In: CoRR abs/1406.6247 (2014). arXiv: 1406.6247.
[17] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. “Feature Visualization”. In: Distill (2017). https://distill.pub/2017/feature-visualization.
[18] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, et al. “Stand-Alone Self-Attention in Vision Models”. In: arXiv e-prints, arXiv:1906.05909 (June 2019), arXiv:1906.05909. arXiv: 1906.05909 [cs.CV].
[19] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, et al. “Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications CCS ’16. Vienna, Austria: ACM, 2016, pp. 1528–1540. A.
[20] K. Simonyan, A. Vedaldi, and A. Zisserman. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”. In: International Conference on Learning Representations (ICLR) (workshop track). 2014.
[21] J.T. Springenberg, A. Dosovitskiy, T. Brox, et al. “Striving for Simplicity: The All Convolutional Net”. In: International Conference on Learning Representations (ICRL) (workshop track). 2015.
[22] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. “Deep Image Prior”. In: CoRR abs/1711.10925 (2017). arXiv: 1711.10925.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need”. In: CoRR abs/1706.03762 (2017). arXiv: 1706.03762.
[24] Kelvin Xu, Jimmy Ba, Ryan Kiros, et al. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”. In: CoRR abs/1502.03044 (2015). arXiv: 1502.03044.
[25] Jason Yosinski, Jeff Clune, Anh Mai Nguyen, et al. “Understanding Neural Networks Through Deep Visualization”. In: CoRR abs/1506.06579 (2015). arXiv: 1506.06579.
[26] Matthew D. Zeiler and Rob Fergus. “Visualizing and Understanding Convolutional Networks”. In: Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, Cham: Springer International Publishing, 2014, pp. 818–833.
[27] Han Zhang, Ian Goodfellow, Dimitris Metaxas, et al. “Self-Attention Generative Adversarial Networks”. In: Proceedings of the 36th International Conference on Machine Learning. Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, Sept. 2019, pp. 7354–7363. A.